CID problem
I installed PDFMiner with pip to extract text from a Japanese PDF and the following problem occurred.
I(cid:888), Intellectual Production Techniques(cid:887)Good(cid:845)Reference(cid:853)Greed(cid:864)(cid:845)(cid:880)(cid:866). People(cid:884)intellectual production techniques(cid:923)teaching(cid:849)(cid:916).
Solution: see Text extraction from PDF.
-----
Notes on the research process
Still have issues with CID Characters · Issue #39 · euske/pdfminer
2014
Pointing out that CMap needs to be reworked.
pdfminer - PDF text extraction returns wrong characters due to ToUnicode map - Stack Overflow
2015
Talking about [ToUnicode map
python - What to do with CIDs in text extracted by PDFMiner? - Stack Overflow
Talking about ToUnicode map
How can I extract embedded fonts from a PDF as valid font files? - Stack Overflow
Can embedded fonts be extracted? Discussion
Extracting Japanese text from PDF with PDFMiner
Reports of two-point stools and others being replaced by CID
I tried to use Python to extract text data from PDF (pdfminer.six) - Arakan "BOKU"'s IT Daily Life
2018
Example of importing and using from within a script rather than from the command line
This one, like my environment, has hiragana as CID.
---
This page is auto-translated from /nishio/CID問題 using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I'm very happy to spread my thought to non-Japanese readers.